Lab 1: Intro to R and data analysis

Caution

Website construction in progress…

Lecture 1: topics illustrated in class

  • Introduction to R and R-studio
    • Why R?
    • Principles of reproducible analysis with R + RStudio
  • R objects, functions, packages
  • Understanding different types of variables
    • Principles of “tidy data”
  • Descriptive statistics
    • Measures of central tendency, measures of variability (or spread), and frequency distribution
  • Visual data exploration
    • {ggplot2}
  • Foundations of inference

___

Lab 1 datasets

Below are the datasets used in the Practice session:

  • Download a whole subfolder



The first section is a quick review of the installation process.

___

Introduction to R and R-studio

Install

R is available for free for Windows , GNU/Linux , and macOS .

  • To install R, you can go to this link. The latest available release is R 4.3.3 “Angel Food Cake” released on 2024-02/29, but any (fairly recent) version will do.

If you have previously installed R on your machine, you can check which version you are running by executing this command in R:

# From the R console
base::R.version.string
    # (This is the version on my own machine)
    # [1] "R version 4.2.2 (2022-10-31)"

…or by executing this command in your CLI (Command Line Interface):

# From Terminal/Powershell/bash
R --version

Install RStudio IDE

While not strictly required, it is highly recommended that you also install RStudio to facilitate your work. RStudio Desktop is an Integrated Development Editor (IDE), basically a graphical interface wrapping and interfacing R (which needs to be installed first).

R, which is a command line driven program, can be executed via its native interface (R GUI), as well as from many other code editors, like VS Code, Sublime Text, Jupyter Notebook, etc. RStudio remains the most widely used by beginners and advanced programmers alike, because of its intuitive and integrated interface.

  • To install RStudio you can go to this link. The free-version contains everything you need.

Managing files and projects

In any analytical endeavor it is very likely that you will handle a collection of files (likely organized in folders, such as input_data, output_data, R_scripts, paper, etc.). R provides a fantastic tool for organizing all the files pertaining to a project called “R project”

Creating an R Project

An R Project will keep all the files associated with a project (including invisible ones!) organized together – input data, R scripts, analytical results, figures. Besides being common practice, this has the advantage of implicitly setting the “working directory”, which is incredibly important when you need to load or output files, specifying their file path.

In Figure 1 you can see how easy it is just following RStudio prompts:

  • Create a new directory for each project
  • Select parent folder
Figure 1: Creating an R project
  • Notice that, now, in the Files tab you see file with the extension .Rproj which is telling R that all folder’s files belong together.

Install R packages

An R package is a shareable bundle of functions. Besides the basic built-in functions already contained in the program (i.e. the base package), many useful R functions come in free libraries of code (or packages) written by R’s users. You can find them in different repositories:

  • CRAN (Comprehensive R Archive Network) - the general package repository for R: https://cran.r-project.org/.
  • Bioconductor - a package repository geared towards biostatistics https://www.bioconductor.org/.
  • GitHub https://github.com/ - a website and cloud-based service that helps developers store and manage their code. Here you will find R package in development stage or the newest version of an existing one (it may be less stable!).
  • and more…

Let’s take for example the R package here, a package that hlps handling files’ paths in a reproducible manner. To install it for the first time, open an R session and execute:

From CRAN (stable version)

# Installing (ONLY the 1st time)
utils::install.packages('here')

# OR (same)
install.packages('here')

Here you are actually using a function (install.packages) of a pre-installed package (utils) using the syntax packagename::function_name. This prevents any ambiguity in case of dplicate funciton name… also helps you see what you are using.

Once you have installed a package, at every subsequent R session, you will only need to load it, like so:

# Loading a package (at every session) 
base::library ("here")
# ... or
library (here)

Using the graphical interface

You can also install and update packages using the “Packages” tab on the lower right pane of RStudio.

Screenshot Install/Update pckgs from RStudio

From GitHub (testing version)

You can use the package devtools and its function install_github to install from the remote repository of GitHub the developer’s version of a package. Let’s try it with a nice little package paint (which colors the structure of dataset when printing).

# Installing devtools (ONLY the 1st time)
utils::install.packages('devtools')

# Installing paint from GitHub 
library(devtools)
devtools::install_github("MilesMcBain/paint")

# test paint out
library(paint)
# it will show me the structure of a data.frame like this... 
paint(mtcars)
# ... instead of plain old 
print(str(mtcars))

After devtools::install_github("MilesMcBain/paint"), R asks me if I want to update related packages… respond in the console choosing the preferred answer.

Help on R package/function

To inquire about a package and/or its functions, you can again write in your console ?package_name or ??package_name and RStudio will open up the Help page in the lower right pane.

# Opening Help page on package/function
?here

??here

Defining (reproducible) file paths: here

It is never good practice to “hard code” the file’s absolute path: most likely this will break your code as soon as you (or someone else) need to run it on a different computer, let alone within a different OS.

So if your code to read & load a file is written like this:

# [NOT REPRODUCIBLE] hard coding your file path  -----------------------

# File path on Mac:
dataset <- readr::read_csv(
  "/Users/testuser/R4biostats/input_data/dataset.csv")
# Same file path on Windows:
dataset <- readr::read_csv(
  "C:\Users\testuser\R4biostats\input_data\dataset.csv")

…it won’t work on someone else’s computer since they don’t have that same file structure!

This is where the fantastic here package intervenes and lets you reference file paths in a reproducible manner (anchored on the R Project’s folder as the root). 1. It let’s you use relative paths, i.e. specify the file path relative to the project folder containing project_name.Rproj. 2. No more “/” v. “\” issue (where Windows and Linus/Mac OSs differ) 3. Add sub folder levels separated by “,”

# [REPRODUCIBLE] reference to file path
library(here)
library(readr)

# Check where is my Working Directory?
here::here()
    # [1] "/Users/testuser/R4biostats"

# Then define file path as ("subfolder_name", "file_name")
# No "\" or "/" needed!
dataset <- read_csv(here("input_data", "dataset.csv"))

Make sure we have R packages needed for the Lab

To install an R package, open an R session execute:

# Installing (only the 1st time)
pkg_list <- c("here", "dplyr", "readr")
install.packages(pkg_list)

# Loading a package (at every session) 
library ("here")
library ("dplyr")
library ("readr")

___

R objects, functions, packages

This was discussed in Lecture 1).

Now we will…

Read a dataset into R workspace

Let’s start by loading the file we will work on.

Loading input data in your workspace

Load the dataset autism_data.csv in your workspace with code that won’t break on another machine / OS

Here is the file path of my .csv file inside R4biostats project folder (~ working directory):

  • practice/data/01_datasets/autism_data.csv
autism_data <- read.csv(file = here::here("practice",
                                          "data",
                                          "01_datasets",
                                          "autism_data.csv"), 
                        header = TRUE, 
                        sep = ",", 
                        na.strings = "?") 

>>>>>>>>> [[[[ QUI! ]]]]

Understanding different types of variables

Principles of “tidy data”

___

Descriptive statistics

Measures of central tendency, measures of variability (or spread), and frequency distribution

___

Visual data exploration

ggplot2

___

Foundations of inference

___

Lab 1 complete R code

Here you will find the solved problems addressed in Lab 1

  • as .R file